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Simple derivations, at a level appropriate for an undergraduate computational physics course, of 
the most popular methods for finding the minimum of a function of many variables are presented 
. in a unified manner in the context of a general optimization scheme that emphasizes their essential 

' similarities. The derivations in this paper encompass the conjugate gradient methods with and 

without conditioning, and the variable metric methods. The common variants of these methods 
including Fletcher-Reeves, Polak-Ribiere, Davidon-Fletcher-Powell, and Broyden-Fletcher-Goldfarb- 
Shanno are described and motivated. 

m ' 

Ph' I. INTRODUCTION 

You are near the top of a long valley and are meeting your friend at the valley's lowest point for a picnic. You 
could probably see the lowest point just by looking around and could walk right to it. To make things challenging you 
have decided to find the minimum in the same way that a computer would. First you blind-fold yourself, since the 
information that you get from your eyes, which includes the heights of a huge area around your location, would be far 
O ' too costly for a computer to calculate. To find the lowest point in the valley while blindfolded you would probably 
! walk down hill, continuously changing your orientation to be always pointing in the direction of steepest descent, 
which you can determine with your feet alone. The path you would take is the same path that a ball would take if 
' velocity was proportional to the force as in Aristotilian dynamics. Regardless, this method is also very costly for a 
computer, since it requires determination of the steepest descent direction at every point along one's trajectory. So 
to make your path more closely match what a computer would do you have procured a magic scooter. You control 
i— H the initial orientation of the scooter, hop on, and away you go. The scooter takes a straight line path and stops 
at the point where to go any farther would take you up hill 1 . This is almost never the lowest point in the valley. 
When the scooter stops you can change its orientation and take another ride. This curious way of going downhill 
actually resembles the procedure many computer algorithms use to find the minimum of a function of many variables. 
Different methods vary only in their choice of which way to orient the scooter at each step, and the method by which 
the scooter finds the local minimum along the ray determined by the scooter's orientation. Finding a minimum along 
' the length of the ray is the problem of finding the minimum of a function of one variable. Routines for doing this are 
described in several popular texts 2 . In this paper we focus on the many variables aspect of the problem, the choice 
of which way to point the scooter, and in particular those choices that correspond to conjugate gradient and variable 
O | metric minimization. 

* 55 ' Conjugate gradient and variable metric (sometimes called quasi-newton) optimization methods are very popular and 
£>V successful algorithms for finding minima of functions of many variables 3 . In this paper we motivate these optimization 
tIh • methods, and derive them in a unified manner in a way that emphasizes their essential similarity. The derivation in this 
Oh- paper encompasses the conjugate gradient method with and without conditioning, and the variable metric method. 
The common variants of these methods including Fletcher- Reeves 4 , Polak-Ribiere 5 , Davidon-Fletcher-Powell 6 , and 
Broyden-Fletcher-Goldfarb-Shanno 7 are described and motivated. In section II we describe what is wrong with the 
steepest descent method. In section III we derive a generalized optimization routine which is the starting point for 
all of the algorithms mentioned above. In section IV we take up some convergence proofs which are followed up in 
the Appendices. In section V we focus on conjugate gradient routines with and without conditioning and in section 
VI we turn to variable metric methods. 
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II. WHAT'S WRONG WITH STEEPEST DESCENT? 



In this section we introduce numerical minimization of a function of many variables and show why the steepest 
descent minimization method is not an optimal choice. Consider a function -E(x) of m unconstrained real variables 
collectively denoted by the m-component vector x. For concreteness you can consider the function to be the energy 
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of a configuration of ions. We seek the position of the ions for which the energy is minimized. As with finding the 
lowest point in a valley, we can think in terms of orienting and moving a single magic scooter but now the scooter 
moves in an m-dimensional space in which each point corresponds to a configuration of all of the ions. For example, 
if there are two ions labeled by coordinates (#1,2/1, z\) and (x 2 ,y2, ^2) the scooter moves in the six dimensional space 
x = (xi,yi,Zi,X2,V2, z 2) ■ In general the n+1 guess for the location of the scooter is determined by 

x n +i =x„ + A„h„ (2.1) 

where h„ is a vector, different for each minimization scheme, that points along the direction the scooter is headed 
and A„ is a positive scaler that represents how far the scooter moves, in units of h„|. Once h„ is chosen we find x n +i 
by varying A„ to minimize E(x n+ i). The minimum is found where the derivative of the energy E with respect to A„ 
is zero, 

dE 

— = V£(x„+i) • h„ = 0, (2.2) 

or by multiplying by A„ 

V£(x„+i) • (x„+i - x„) = 0. (2.3) 

This result tells us that the gradient, Vi?(x n +i), at the new minimum x n+ i is perpendicular to the descent direction, 
x„ + i — x„, that took us to the minimum. This result is easy to visualize. The dot-product of the gradient of the 
energy with a particular vector represents the change in energy along that vector. The scooter stops when this is zero 
since that indicates that the energy is no longer decreasing in that direction. 

Numerical minimization schemes differ in their choice for h„. A naive choice is the steepest decent direction at x„ , 

h„ = -V£(x„). (2.4) 

As well described in standard texts 3 , this choice turns out to be far from optimal since it forces each new decent 
direction to be orthogonal to the previous one. The path to the minimum using this method is a series of 90 degree 
jogs and it can take many steps of this sort to get close to the true minimum. Consider, for example, a valley with 
a long shallow axis and a short steep axis shaped like a canoe or a long wooden stirring spoon. Start the scooter 
somewhere away from the long axis. The steepest descent direction will point somewhere between the shortest path 
to the long axis, and the direct path to the minimum. Most people assume that the scooter will stop at the long axis 
and the next move will be along the long axis to the minimum. But the lowest point along the ray must overshoot 
the long axis by some amount for the next steepest descent direction to be 90 degrees from the path just taken. This 
is because the energy lost by going parallel to the long axis isn't overtaken by the gain in going up the steep side until 
some distance beyond the long axis. For this reason, the steepest descent path constantly overshoots, taking many 
short 90 degree jogs along the long axis on the way to the minimum, when only two steps appear to be necessary. 
The way out of this dilemma, adopted by the conjugate gradient and variable metric methods, is to use some other 
direction than steepest descent for the descent direction. This is what wc turn to in the next section. 

III. A UNIVERSAL OPTIMIZATION PROCEDURE 

In this section we improve upon the steepest descent method and develop a universal framework for optimization. 
The basic idea is to pick a descent direction that has some positive projection onto the steepest descent direction 
but also incorporates information about previous moves into our n+1 guess. A natural choice, as we will see below, 
for the n+1 guess constructs the direction vector from a linear combination of all previous displacements plus a new 
vector v n that incorporates new information about E at x n , 

n 

t,, h l = X„ + A„ [ V„ + ^2 °j,n(Xj - Xj_i) I . (3.1) 

3=2 

For conjugate gradient minimization v„ = —VE(x n ). For conjugate gradient with conditioning or variable metric 
minimization v„ = —H n \7E(x n ) where H„ is an appropriately chosen symmetric matrix called the conditioner if it 
is independent of n or the variable metric if it depends on n. The variables A„ and a^ n are varied either analytically 
or numerically to minimize the function in question. We will minimize E analytically with respect to the aj.„'s. E 
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is then a function of the single variable A„ and can be minimized numerically using a routine to find the minimum 
of a function of one variable. Choosing v„ and the aj t „ beforehand determines h„ and is like orienting the scooter. 
Determining A„ numerically is like taking a scooter ride. 

When we take the derivative of the energy E with respect to A„ at the point x n+ i and set it to zero, we get Eq.(2.3), 
as before, but now V-B(x„ + i) is not the next descent direction so this condition does not constrain us to right angle 
turns as was the case for steepest descent minimization. 

We get n-1 new conditions when we take the derivative of the energy E with respect to Oj> at the point x n +i and 
set it to zero (In practice it is always clear that we are dealing with the n+1 guess so we have throughout the rest of 
this paper suppressed the second index in the aj. n .), 

dE 

— = A„V£(x„+i) -(xj -xj_i) = 2<j<n. (3.2) 

Divided by A„ and combined with Eq.(2.3) we have 

V^(x n+1 )-(x j -x j _i)=0 2<j<n+l. (3.3) 

which tells us that the gradient at step n + 1 is orthogonal to all previous displacements. These relations will be a 
great aid in simplifying expressions. 

We will get back to Eq.(3.2) shortly but first we need to express V.E(x n+ i) in terms of the a/s and known quantities. 
To do so we assume an explicit form for E(x). We develop our optimization scheme assuming that our guess for the 
minimum is close enough to the true minimum x m i n that we can approximate E(x) by its Taylor expansion to second 
order around its true minimum. 

E(x) = £(x min ) + i(x - x min ) • A • (x - x min ), (3.4) 

where the first order term is zero since we are expanding E(x) around its minimum. The m x m symmetric constant 
matrix A is the diadic second derivative of E(x.) evaluated at x m i n with matrix elements Aij = d 2 E(x m i n )/dxidxj. 
For a unique minimum, A must have only positive eigenvalues. 

The quadratic form for the energy Eq.(3.4) allows us to find an explicit expression for the gradient 

VS(x) - A • (x - x min ) = (x - x min ) • A. (3.5) 

However, for this to be useful we must express it in such a way that x m ; n and A do not explicitly appear since they 
are unknown. We can get rid of x m ; n by subtracting the gradients at any two points in configuration space, 

VB(x) - VE(y) = A • (x - y) = (x - y) • A. (3.6) 

This is an incredibly useful formula since it allows us to turn any expression with A, which we don't know, into 
quantities we do know as long as we can arrange for A to operate on a displacement. This is the reason why including 
previous displacements in our n+1 guess turns out to be the natural way to incorporate information from previous 
moves. We now have what we need to manipulate Eq.(3.2) to find the n-1 aj's. 

First we use Eq.(3.6) to find an explicit expression for Vi?(x„ + i) in Eq.(3.2) by setting x = x„ + i and y = x„, 

V£(x n+1 ) = V£(x„) + (x 

n+1 ) ' 

(3.7) 

We now use this expression for V-B(x„ + i) in Eq.(3.2) to get, after dividing by A n , 

V.B(x„) • (xj — Xj_i) + (x n+ i — x n ) • A • (xj — Xj_i) = 2<j<n. (3.8) 

The first term in Eq.(3.8) is zero via Eq.(3.3) with n+1 replaced with n. We are left with 

(x„+i - x„) • A • (xj - Xj_i) = 2<j<n. (3.9) 

We can now, as promised, get rid of the unknown A via Eq.(3.6) since it is operating on a displacement. 

(x n+1 - x„) • (V£( Xj ) - V-E^-i)) - 2 < j < n. (3.10) 

When we put the formula for x n+ i, Eq.(3.1), into Eqs.(3.10) we get, after dividing by A„, what we have been after 
since the beginning of this section: n-1 equations for the n-1 aj 's in terms of known quantities from previous steps, 
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^v„ + X> fe (x fc -x fe _ 1 )J ■(V£(x J -)-V£7(x J -_i)) = 2<j<n. (3.11) 

By inverting these equations we can find a direction to orient our scooter that is closer to optimal the closer we are 
to the minimum and the closer the quadratic approximation is to E. This is already more than we had any right 
to expect but it gets even better! These n-1 equations are uncoupled since by combining Eq.(3.6) with Eq.(3.9) and 
using the fact that A is symmetric, we have 

(x fe - x fe _i) • A • (xj - Xj_i) = (x fe - x fe _i) • (VE(xj) - VE(x j _ 1 )) 

= (V%) - VB(xfc_i)) • (x 3 - - Xj-_i) 

= 5 jlk (VE{ Xj ) - V%_!)) • ( Xj - x,_i) (3.12) 

so that all terms in the k-sum of Eq.(3.11) drop out except the k=j term. The resulting formula for aj is 

„ v„ ■ (Vg(xj) - V^(x 3 -_i)) 

J (V^x^-V^x^))^^-^)' 1 • J 

Plugging this into the update formula Eq.(3.1) we obtain the expression that all of the optimization schemes take as 
a starting point. 

This is the most important equation in this paper. Conjugate gradient and variable metric routines use this formula 
and only differ in their choice of v„ . 



IV. CONVERGENCE 



Of course, none of this is worth it unless Eq.(3.14) is demonstrably faster than steepest descent. And it is, at 
least for those cases where one is close enough to a minimum that the second order Taylor expansion for E is a good 
approximation. For quadratic energies, this updating scheme will converge to the minimum of E in no more steps 
than the number of components of x (two for the canoe, not including the first guess), provided that the v„'s are 
chosen to be linearly independent of each other. The proof is straightforward. Consider the m + 1 step of Eq.(3.14). 
As long as v TO and the m — 1 (xj — Xj_i)'s are linearly independent, which will be true if all the v„'s are linearly 
independent, then by varying A m and the m — 1 Oj's, x m+ i spans the entire configuration space. Since A m and all 
the dj's are minimized exactly (remember the a/s are exact only for quadratic functions) x TO+ i is guaranteed to be 
the minimum. So, our intuition that it should only take two steps to find the minimum of a canoe is correct, as long 
as we have a quadratic canoe. 

We can achieve the minimum in fewer than m steps with a clever choice of v„, provided that E is exactly given 
by Eq.(3.4). For example, the steepest descent method, vi = — V-E(xi), finds the minimum in one step regardless 
of the dimensionality of the configuration space for the maximally symmetric case of quadratic functions with A 
proportional to the identity matrix (a circular parabolic bowl is an example). More generally one step minimization 
for quadratic functions is always possible with 

Xmin = X„ - A" 1 • VE(x n ), (4.1) 

found by inverting Eq.(3.5). This motivates choosing v„ = — H„ • VE(x n ) where the matrix H„ is a possibly different 
symmetric positive definite matrix at each step and is ideally a good approximation for A -1 . Different choices for 
H n correspond to different conjugate gradient and variable metric routines. 

In the appendix we show that optimization of quadratic functions with appropriate choices for H„ converges to the 
minimum of E in no more steps than the number of distinct eigenvalues of HiA, which can be much less than m for 
highly symmetric systems (notice that this agrees with 1 step minimization when Hi is proportional to A -1 ). In the 
Appendix we also show that these schemes automatically construct a better approximation for A -1 at each step so 
the optimal choice for A„ approaches 1 for large n, regardless of how poorly Hi initially approximates A -1 . 

Even if the energy is not well approximated by a quadratic function, positive definite H„ still guarantees that the 
positive A„ direction points down hill. The proof is straightforward. Downhill is determined by 
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V£(x n ) • (x„+i - x„) < 0. 



(4.2) 



Replacing the displacement x n+ i — x„ with Eq.(3.14) and using Eq.(3.3) to remove all terms in the j sum we have 

-A n V£(x n ) • H„V£(x„) < 0, (4.3) 

which guarantees that E decreases in the positive A„ direction, at least initially, provided H„ is positive definite. 

If the energy function is not quadratic, convergence to the minimum is not guaranteed for any finite number of 
steps since the a/s given by this procedure are no longer the true minima. Regardless, these methods are generally 
better than steepest descent, since smooth functions are generally closer to quadratic the closer one is to the minimum 
so the approximate Oj's get progressively closer to their true values. Many implementations of these methods switch 
over to steepest descent minimization in the event that x„ is in a region that is sufficiently far from quadratic. 



V. CONJUGATE GRADIENT OPTIMIZATION 



We are now ready to derive conjugate gradient optimization formulas. This section will be short since we have 
already done most of the work we need to do. Conjugate gradient optimization is just the general optimization 
procedure, Eq.(3.14) with v„ = — H • VE(x n ) where H, called the conditioner, is an appropriately chosen symmetric 
positive definite matrix. This is motivated by the exact result in case of quadratic functions Eq.(4.1). If one has a 
good estimate for A -1 then one can use it for H. Often no such information is known and the identity matrix is used 
for H. The updating formula is 

x -x A (lIVE(x) Vfx x ^V^J-V^-i^H-V^) ^ 

X " +1 - X " - [ U VEM - 2>' - X - l) (x,-x J _ l) -(V^x,)-V^_ l) ) J • (5 ' 1} 

The conjugate gradient choice for v„ has the wonderful feature that it makes all terms in the j sum in Eq.(5.1) 
equal to zero except for the j=n term. To see this note that 

V £(x „> . H • V £( x, - V £( x„> • f H • VE^ - £<x t - ^,) g^^L|^ ) , (5,) 



k=2 



for j < n, where we have added terms that are zero via Eq.(3.3). Now using Eq.(5.1) we can replace the terms in 
parenthesis with the displacement so we get 

V£(x n ) • H • VE( Xj ) = V£(x„) ■ (xj+i - x,)/A,, (5.3) 

which by Eq.(3.3) is zero for j < n — 1, so that we have the result 

VS(x n )-H-V%) = 2<j<n-l (5.4) 

Eq.(5.4) makes all terms in the j sum zero in Eq.(5.1) except for j = n. We are left with 

x„ +1 . x„ - k (h ■ ve M - (x. - ^_ l) (v E (x.)- v^-,))-H.y E (x„n 

V (x n -x„_i) • (V£(x„) - V-E(x„_i))y 

This expression for x„ + i is the Polak-Ribierc conjugate gradient updating formula (However, see Appendix A). One can 
come up with dozens of other formulas simply by adding or subtracting terms that are zero if the function is quadratic. 
Another popular choice is the Fletcher-Reeves updating formula which, using the fact that V£(x n _i) -H- V_E(x„) = 
from Eq.(5.4), simplifies Eq.(5.5) to 

^ ( TT ^ I , V£(Xn) ■ H ■ V^(x») \ 

x„+i = x„ - A„ H • V£ x„ - x„ - x n _i)- v ' — - — — 5.6) 

V (x„ -x„_i) • (V£(x n ) -V£(x n -i)).y 

The Polak-Ribierc and the Fletcher-Reeves formulas are equal if the energy function is quadratic. If the 
energy function is not quadratic it is possible for the Polak-Ribiere numerator to be negative, while the 
Fletcher-Reeves numerator is always positive. Empirically the best form for the numerator appears to be 



5 



max ((V£(x n ) — V_E(x„_i)) • H • V_E(x„), 0) which uses the Polak-Ribiere numerator but restarts the conjugate gra- 
dient routine with a steepest descent step (assuming the conditioner is the identity) if the P-R numerator goes 
negative. 

The conjugate gradient update formulas are swell but they don't exactly look much like the one step formula 
Eq.(4.1) that we might have expected them to resemble. This can be remedied by taking advantage of the ability to 
create matrices by forming the outer product of two vectors. For the Polak-Ribiere formula we can write 

x„+i — x„ - A n \ ti — — — • Vii(x„j, (5.7) 

V (x„ - x n _i) • (V£(x n ) - V£(x n _i)) J 

in which (^) connotes the outer product. A similar expression holds for the Fletcher- Reeves update formula. This 
change to the form of the descent direction is entirely cosmetic, and is given primarily to show the similarity with 
Eq.(4.1). However, Eq.(5.7 suggests the possibility that the expression in parenthesis, modified to exhibit symmetry, 
positive definitcness and other properties we expect of an approximation for A -1 , would be an improvement over H 
that could profitably be used in the next update in its place. This is the motivation for variable metric minimization 
which is discussed in the next section. 

VI. VARIABLE METRIC OPTIMIZATION 

For variable metric optimization we choose v„ = H„ • \7E(x n ) where H„ is an appropriately chosen symmetric 
positive definite matrix that is updated at each step to more closely resemble A -1 . The initial Hi is often the identity 
matrix. A -1 has the useful property determined by the inverse of Eq.(3.6) that 

x - y = A" 1 • (V£(x) - V£(y)) = (V£(x) - VE(y)) ■ A" 1 , (6.1) 

which is good for any x and y. Since H„ is supposed to be a good approximation for A -1 we will require that H„ is 
symmetric and satisfies Eq.(6.1) for x and y chosen from the set of for j < n. This will be so if we require 

H„ • (WE(xj) - V£(x j _i)) =Xj --x.j-1 2<j<n. (6.2) 

For an H„ satisfying Eq.(6.2) the update formula 

X " +1 - X " - A ^ ' VEM - 2-,^ - X - l} (V^x,)-Vi ? (x,_ 1 ))-(x,-x,_ 1 ) ) (6 - 3) 

simplifies to 

x„ +1 = x„ - A„H„ • V£(x„) (6.4) 

via Eqn.'s (6.2) and (3.3). The resulting expression is naturally in the form resembling Eq.(4.1) so H„ is acting like 
an approximate A -1 . 

There are many ways to construct an H„ that satisfies Eq.(6.2). An interesting formula results if we choose 
v„ = — H„_! • \7E(x n ) but insist that we still end up with Eq.(6.4) for x„ +1 . For this to be true we must have 

„ , „ -„ ( , , s (Vg(xJ - V^(x„_ 1 )) • H n _! ■ V£(x») 

H„ • VB x„ = H„_i • V£(x„) - (x„ - x n _i) — - — — — — — (6.5) 

(V£(x n ) - VE(x n _i)) • (x„ - x„_i) 

An H„ that satisfies this is 

„ _„ (Xn - ®H„-1 • (Vg^Q - V^(x»_ 1 )) 

" " _1 (V£;(x„)-V J B(x„_ 1 )).(x„-x„_ 1 ) ' ^ 

which is an iterative form of Eq.(5.7). However, this H„ is a bad approximation for A -1 since it is neither symmetric 
nor does it satisfy Eq.(6.2). This can be remedied by amending H„ with terms that give zero for any E, quadratic or 
not, when operating on V£ l (x„); 

H =R (x» - x ra _x) 0H n _! ■ (V£(x») - V.E(x n _ 1 )) H n _! ■ (Vg(xn) - V^(x„_ 1 )) (g)(x„ - x^ 

" _1 (V£(x„) - VS(x„_i)) • (x„ - x n _!) (V£(x„)- VE(x n _ 1 ))-(x n -x B _ 1 ) 

+ a(x n -x„_i)(g)(x„ -x„_i). (6.7) 



The next to last term was added to make H„ symmetric. The last term is allowed with any value of a. Remarkably, 
there is a special value: 

1 + (V£(Xn) - V£(Xn-i)) ■ H„_! • (V£(x») - V^(x n _ 1 )) \ g 



(V£(x tl ) - V£(x„_i)) • (x„ - x„_i) V (V£(x n ) - V£(x„_i)) • (x„ - x„_i) 

for which Eq.(6.2) is automatically satisfied, so that we are improving our approximation for A -1 at every step. With 
this value for a Eq.(6.7) is the Broydcn-Fletcher-Goldfarb-Shanno variable metric updating formula. The proof that 
H„ is positive definite, needed to guarantee that the positive A„ direction points downhill, is left for the Appendix. 

Another popular variable metric update formula, the Davidon-Fletcher-Powell scheme, was developed with only 
Eq.(6.2) and symmetry in mind, with no regard for Eq.(6.5) (although it turns out to satisfy it as a proportionality 
rather than an equality as we show in the Appendix). It is 

H =R H ra _x ■ (V£(x„) - V£(xn_i)) ® H n _i ■ (V£(x») - Vg^-i)) 
" _1 (V£(x„) - V£(x„_ 1 )) ■ H„_! • (VS(x„) - V£(x n _!)) 

(x» - x»_i) (g)(x» - X»_l) 
(V£(x n ) - V£7(x„_i)) • (x„ - x„_!) " [ ' ' 

This is an arbitrary formula in that we can add the term bu ® u to H„ with 

= (Xn -Xn_i) H n _i • (V-E(Xn) - V£(x„_i)) 

U (VE(x n )-V£(x n _ 1 ))-(x„-x n _ 1 ) (V£;(x„)-V J B(x n _ 1 ))-H„_ 1 -(V J B(x n )-V£;(x„_ 1 )) { ' } 

for any b and still obey Eq.(6.2). The choice 

b = (V£(x n ) - V^(x„_!)) ■ H„_! • (V£(x„) - V£(x„_!)) (6.11) 

corresponds to the BFGS scheme. It also is the only choice that makes H„ linear in H n _i. The Davidon-Fletcher- 
Powell scheme, for any b, is exactly equivalent to the Broyden-Fletcher-Goldfarb-Shanno scheme if the energy function 
is quadratic. Empirically, for more general nonquadratic energies, the Broyden-Fletcher-Goldfarb-Shanno scheme is 
usually better than the 6 = Davidon-Fletcher-Powell scheme. 

For quadratic energy functions variable metric and conjugate gradient methods are equivalent. The proof is in 
the appendix c. Variable metric minimization differs from conjugate gradient minimization for non-quadratic energy 
functions. 

The variable metric method requires more memory than the conjugate gradient method since for variable metric 
optimization one is building up an m x m matrix. In practice, for large m, one can cut this way down by keeping 
information from only the last q < m steps 8 . My experience has been that for nonquadratic functions the variable 
metric method with q > 1 converges faster than the conjugate gradient method (equivalent to variable metric with 
q = 1) but past performance is no guarantee of future returns. 



APPENDIX A: WILL THE REAL CONJUGATE GRADIENT FORMULA PLEASE STAND UP? 

Eq.(5.5) is the Polak-Ribiere conjugate gradient update formula and Eq.(5.6) is the Fletcher- Reeves formula. How- 
ever, the conventions for conjugate gradient minimization and variable metric minimization are different. In this 
paper we have followed the variable metric convention throughout so our expressions for conjugate gradient updates 
are disguised. In what follows we make several changes in the appearance of Eq.(5.5) and Eq.(5.6) relying only on the 
orthogonality of a gradient with the most recent displacement, Eq.(2.3), which is always true regardless if the energy 
function is quadratic or not. The resulting formulas are the conventional expressions for the eg formulas that appear 
in most text books. 

First we can simplify the denominator in the Polak-Ribiere routine. 

x„+i = x„ — A„ H ■ V£ x„ + x„ - x„_iP - '' — ^ — '- Al) 

V (x„ - x„_i) ■ V£(x„_i) J 

One is also always free, from the definition Eq.(2.1), to express the update scheme in terms of h„_i instead of 

Xn X n _i . 

x„ tl . x „ - A„ ( H ■ VE M + ^.. gM^gjl^ ) m 
V h n _ r V%-i) J 
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Next we expand the denominator using the eg form for h n _i. 

h n _i • V£;(x„_i) = H V£(x n _i) + h„_ 2 r ; ■ V£/(x n _i) 

V h n-2 • V-fr(X n _iJ / 

= — V-E'(x„_ 1 ) • H • V-B(x„_!), (A3) 
where the second term was zero by orthogonality, Eq.(2.3). The update formula now looks like 

, /„ V„ f v , (V^(Xn) - VfltXn-!)) • H ■ V£(x») \ 

This is the formula that is conventionally used in implementations of the Polak-Ribiere update formula. 
Similarly, the conventional Fletcher-Reeves update is 

= - - *. ( H ■ - h -- vJ(S:g:^i',) ) • < A5 » 

which we remind the reader is only equivalent to the Polak-Ribiere update formula for quadratic energy functions. 



APPENDIX B: PROOF THAT Hat IS POSITIVE DEFINITE PROVIDED Hjv_i IS POSITIVE DEFINITE 

When finding the minimum of a function it is important to know which way is down. As shown in section IV, 
for conjugate gradient and variable metric minimization, downhill is in the direction of positive A„ provided that 
H„ is positive definite. For conjugate gradient minimization H„ does not change with n so positive definiteness is 
trivially guaranteed by making H positive definite initially. For variable metric minimization, on the other hand H„ 
is updated at each step so it is not obvious that if Hi is chosen to be positive definite that H„ will be too. 

We will simultaneously show that the Broyden-Flctcher-Goldfarb-Shanno and Davidon-Fletcher-Powell forms of 
H„ are positive definite provided that H n _i is positive definite, regardless if the energy is quadratic or not. We start 
with the BFGS expression for H„, Eq.(6.7), and the DFP expression for H„, Eq.(6.9) written in the alternative form 8 

H„ = (1 - St) • H„_i • (1 - S) + _ F . (X "' X " F 1 . )(2)(X "' X "- 1) r, (Bl) 

(V£(x„) - V£(x„_i)) • (x„ - x„_i) 

where 

g _ (V£(X») - Vg^Cn-i)) 0(X„ - Xn-j) 

(V£(x n ) - V£7(x„_i)) • (x„ - x„_i) 
= (V-E(xn) - V^(x»_ 1 ))0(V^(x n ) - V J E(x n _i))H n _i 
(V£(x n ) - V J B(x„_ 1 )) • H„_! • (V£(x„) - V£(x„-i)) 

To prove positive definiteness we show that v • H„ • v > for arbitrary v. 



DFP. (B2) 



tr tr , (v • (x„ - x„_i 2 

vH„-v = r-H„_i-r+— — — — — — -, (B3) 

(V£(x„) - V^(x„_i)) • (x„ - x„_i) 

in which 

r = (1 - S) • v. (B4) 

The denominator of the last term is always greater than zero provided that H„_i is positive definite, 

(V%)-V£(x n _i))-( ) = A n _ 1 V£(x„_ 1 ) ■ H„_! • V J B(x„_ 1 ) > 0. (B5) 

It is also the case that r and v • (x„ — x„_i) can not both be zero at the same time if v ^ 0. This is because r = only 
when v is proportional to (\7E(x n ) — V-E(x n _i)) which makes v • (x„ — x„_i) unequal to zero via Eq.(B5). Therefore, 
assuming H„_i is positive definite, Eq.(B3) is the sum of two nonnegative numbers, of which at least one must be 
greater than zero. Hence H„ is positive definite provided that H„_i is positive definite. Consequently the positive 
A n direction points down hill for the variable metric routines just as it does for the conjugate gradient routines. 
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APPENDIX C: PROOF THAT ALL OF THE VARIABLE METRIC AND CONJUGATE GRADIENT 
SCHEMES ARE IDENTICAL FOR QUADRATIC FUNCTIONS 



In this section we show that all of the variable metric and conjugate gradient schemes produce the same sequence 
of x„'s for quadratic functions. We have already shown that the Polak-Ribiere and the Fletcher Reeves conjugate 
gradient routines are equivalent for quadratic functions. We will now show that the variable metric routines are 
equivalent to conjugate gradient routines, and that therefore they are also equivalent to each other. The proof is 
inductive and relies on the following proportionality relations for descent directions 

H„ • V£(x„) oc H fe • V£(x„) - (x„ - ^-O ^^.v^^)).^.^) * < n. (CI) 

The weaker requirement of proportionality rather than equality is sufficient to guarantee that the descent directions 
are the same for different fc; the equality of the displacements and therefore the x„'s is recovered when X n is varied 
numerically. This is trivially satisfied for k = n — 1 in the Broyden-Fletcher-Goldfarb-Shanno updating formula 
where the proportionality is the equality in Eq.(6.5), and was the motivation for the BFGS update scheme. The 
proportionality is also true for k = n — 1 for the Davidon-Flctchcr-Powcll updating formula as we will prove below. 
Putting off the k = n — 1 proof for the DFP formula for now, we will show that for either scheme if Eq. (CI) is satisfied 
for Hfc then it is also true for Hk-i- Once this is proven the equivalence between variable metric and conjugate 
gradient optimization is achieved by setting k = 1. Plugging either the BFGS expression for Hfc, Eq.(6.7), or the 
DFP expression for Hfc, Eq.(6.9), into Eq.(Cl) gets us back Eq.(Cl) with Hfc_i provided that E is quadratic (so that 
we may make liberal use of Eq.(3.3)) and that 

(V%) - V£(x fc _i)) • H fe _! • V£(x n ) = 0. k<n. (C2) 

The second part of this expression is zero via Eq.(3.3) since £?(xfc_i) • Hfc_i is proportional to the displacement 
Xfc — Xfc_i. To show that -E(xfc) • Hfc_i is also proportional to a linear combinations of displacements depends on 
Eq.(Cl) being true for k = n — 1. If it is then 

„ P , x , , (v^(xfc)-vii;(xfc_ 1 ))-Hfc_ 1 -vj ; ;(xfc) 

Hfc_r • = - x fc _ l} (V£(xfe) _ v^)) . (xfe _ Xfe _ l} + 7(x k+1 - k), (C3) 

where 7 is a proportionality constant. This is zero when dotted into Vi?(x n ) so the theorem is proved. Setting k = 1 
in the displacement formula Eq.(Cl) completes the proof of equivalence of the BFGS variable metric and conjugate 
gradient optimization for quadratic functions. To show that DFP variable metric optimization is also equivalent to 
conjugate gradient optimization for quadratic functions we must show that Eq.(Cl) is satisfied for k = n— 1 using the 
DFP H„. To do so we operate with, Eq.(6.9), the DFP formula for H„ in terms of H„_i, on the gradient VE(x n ). 
The result is 

H„ • V£(x„) = H^ • V%) - H„_i • (VE( Xn ) Vi?^)) V^O^-i ■ V^(x n) 



(V£;(x„) - V£(x n _i)) • H„_! • (WE{xn) - V£(x„-i)) 

(C4) 



Combining terms we get 



VE(x n ) ■ H n _r • V^x,,) V (V£(x„) - V^(x„_!)) • (x n - x n _r) 

(C5) 

where we have used the fact that H n _i • V-E(x„_i) is proportional to x„ — x„_i, and we have used Eq.(3.3) to simplify 
the proportionality prefactor. 

This completes the proof that all of the standard conjugate gradient and variable metric optimization routines 
produce exactly the same sequence of x n 's for quadratic functions. We can use this fact to prove fast convergence of 
all of these routines for the case of quadratic functions merely by showing that it is true for one. This is what we do 
in the next section of the appendix. 



9 



APPENDIX D: PROOF OF FAST CONVERGENCE FOR QUADRATIC FUNCTIONS 



In this section we show that optimization with Vj = — H • VE(xj) converges to the minimum of quadratic functions 
in the number of steps equal to the number of distinct eigenvalues of H • A, regardless of the dimension of the 
configuration space. Since conjugate gradient and variable metric minimization is the same for quadratic functions, 
it is sufficient to show this for the conjugate gradient minimization. The proof is based on the one step minimization 
formula 

x mm =x- A- 1 VS(x), (Dl) 

and the little known fact, which we prove below, that if a matrix M has p distinct eigenvalues then M _1 is a polynomial 
in M with highest power p — 1, 

p-i 

M : » iNIi ''- (D2) 

3=0 



With M = HA we have 



Plugging this into Eq.(Dl) we have 



p-i 

A-^^c^HAjiH. (D3) 



p-i 



x min = x - c,'(HA) 3 'H • V£(x). (D4) 

1=0 

In this appendix we prove that for quadratic functions, and for HA with p distinct eigenvalues, Eq.(D4) is equal to 
the conjugate gradient expression after p steps, not including the first guess. 
First we prove Eq.(D2). For M with p distinct eigenvalues we have 

f[(e j -M) = (D5) 
where £j arc the distinct eigenvalue of M. Therefore the identity can be written 

I = I-/?nte-M) (°6) 

for any fi. Consequently 

p 

IVT 1 = (I-/?IJ(ej -M^Mr 1 (D7) 

j=l 

for any (3. In general this is an implicit equation for M _1 but with j3 = 1/ IIaUi e k eacn term in parenthesis contains 
at least one power of M, cancelling out M _1 on the right hand side. The final result expresses M" 1 explicitly as a 
polynomial in M alone with highest power p — 1, as alleged in Eq.(D2). The results for p from one to four are 

M^ 1 = —I (D8) 
ei 

M^ 1 =(- + -) I -—M 2 (D9) 



M3- 1 = (- + - + -)!-(— + — + —) M 3 + — (D10) 



1 

— + 








£l£2 


1 

— + 


1 1 

£2 £3 . 






1 

— + 

ei 


1 1 
— + — 

£2 £3 




1 


1 

- + 


— + 



\£l£2 £l£3 £2£3 / £l£2£3 

M , ' (— + — + — + — )l — ( — 1 1 1 1 1 — ^ M 4 

\ e i £ 2 £l£3 £l£4 £2£3 £2£4 £3£4/ 

+ (— + — + — + — Ml (Dll) 

£l£2£3 £l£2£4 £l£3£4 £2£3£4/ £l£2£3£4 
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Next we show that conjugate gradient minimization is equal to Eq.(D4) after p steps. The proof is inductive. We 
will prove that for conjugate gradient minimization 

x„ = Xl + P„_ 2 (HA)HV£( Xl ) (D12) 

where P„_ 2 (HA) is a polynomial of order n — 2. First, it is trivially true for n — 2 since x 2 = xi — HV-E(xi). We 
will now prove that if it is true for the pth step then it is also true for the p+lth step where 

p 

x p+ i = Xp - A P H • V£(x„) + A p y^ Oj(xj - Xj_i), (D13) 

in which the a/s are the coefficients of your favorite conjugate gradient routine. Using Eq.(3.6) we can put things in 
terms of xi. 

p 

x p+ i = xi + (x p - xi) - ApH • (V(£(xi) + A(xp - xi)) + A p ^a J ((x j - x x ) - (xj_i - xi)) (D14) 

1=2 

Therefore using Eq.(D12) we have 

xp+i = xi + ^-ApHAPp_ 2 (HA) + Pp- 2 (HA) + X p ^ a p (Pj-2 (HA) - P r3 (HA)) - A p j • H • (V£(xi), (D15) 

where P_i = 0. The expression in parenthesis, because of its first term, is a polynomial of order p — 1. Indeed it can 
be any polynomial of order p — 1 since A p and the p-1 Oj's in principle can be varied arbitrarily. In particular it can 
be the polynomial in Eq.(D4). The conjugate gradient formula picks out the polynomial of order p—1 that minimizes 
E(x p+ i). Since E is minimized for x p+i = x m j n , Eq.(D15) must equal Eq.(D4) and the theorem is proved. 



1 The scooter represents a routine that minimizes the function along a particular ray. A real line minimization routine moves 
quite unlike the motion of any real scooter, more like several pogo sticks bouncing along the ray until the minimum is found. 
If there is more than one minima, such routines can not guarantee that the minima found is the closest one to your starting 
point. References for line minimization routines can be found the next citation. 

2 Some texts, in addition to those listed in the next citation, that describe minimizing a function of one variable are : Ale- 
jandro L. Garcia, Numerical Methods for Physics (Prentice Hall 1994), and Forman S. Acton Numerical Methods that Work 
(Mathematical Association of America) 

3 Several texts that describe conjugate gradient and variable metric minimization are: Jorge Nocedal and Stephen J. Wright, 
Numerical Optimization (Springer, New York, 1999), corrected second printing, Elijah Polak, Optimization: Algorithms 
and Consistent Approximations, (Springer, New York, 1997) , William H. Press, Brian P. Flannery, Saul A. Teukolsky, 
Wiiliam T. Vetterling Numerical Recipes in Fortran/ C(++): The Art of Scientific Computing, (www.nr.com). Also see 
Jonathan Richard Shewchuk, "An Introduction to the Conjugate Gradient Method without the agonizing pain," www- 
2 . cs . emu .edu/ jrs/jrspapers.html. 

4 R Fletcher CM. Reeves, Comput. J. 7 (2), 149-154 (1964) 

5 E Polak and G. Ribiere, Rev. Fr. Inform. Rech. Operation. (16-R1) 35-43 (1969) 

6 W. C. Davidon, AEC Research and Development Rept. ANL 5900 (Rev.) 1959., R. Fletcher and M. J. D. Powell, Comput. 
J. 6, 163-168 (1963) 

7 C. G. Broyden, J. Inst. Mat. Appl. 6 222-231 1970, R. Fletcher Comp. J., 13 317-322 (1970), D. Goldfarb Mat. Comp. 24 
23-26 (1970), D. F. Shanno, Mat. Comp. 24 647-656 (1970), D. F. Shanno, J. Opt. Th. Appl., 46 (1) 87-94 (1985) 
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